Whats Covered

Aditional Resources


Filtering joins and set operations


Semi-joins

Apply a semi-join

As you saw, semi-joins provide a concise way to filter data from the first dataset based on information in a second dataset.

For example, the code in the editor uses semi_join() to create a data frame of the artists in artists who have written a song in songs.

# View the output of semi_join()
new_artists <- 
  artists %>% 
  semi_join(songs, by = c("first", "last")) 

# Create the same result
new_artists <- 
  artists %>% 
  right_join(songs, by = c("first", "last")) %>% 
  filter(!is.na(instrument)) %>% 
  select(first, last, instrument) 

new_artists %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument
John Lennon Guitar
Paul McCartney Bass
Tom Jones Vocals

You can see that a single semi-join is the equivalent of three separate operations.

Exploring with semi-joins

Semi-joins provide a useful way to explore the connections between multiple tables of data.

For example, you can use a semi-join to determine the number of albums in the albums dataset that were made by a band in the bands dataset.

albums %>% 
  # Collect the albums made by a band
  semi_join(bands, by = c("band")) %>% 
  # Count the albums made by a band
  nrow() %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
x
5

A more precise way to filter?

  • You can use semi_joins with a df of date you want to match to easily filter
tracks
## # A tibble: 21 x 6
##    track                     band            label      year first  last   
##    <chr>                     <chr>           <chr>     <int> <chr>  <chr>  
##  1 Can't Buy Me Love         The Beatles     Parlopho~  1964 Paul   McCart~
##  2 I Feel Fine               The Beatles     Parlopho~  1964 John   Lennon 
##  3 A Hard Day's Night        The Beatles     Parlopho~  1964 John   Lennon 
##  4 Sound of Silence          Simon and Garf~ Columbia   1964 Paul   Simon  
##  5 Help!                     The Beatles     Parlopho~  1965 John   Lennon 
##  6 Ticket to Ride            The Beatles     Parlopho~  1965 John   Lennon 
##  7 I am a Rock               Simon and Garf~ Columbia   1965 Paul   Simon  
##  8 Yellow Submarine / Elean~ The Beatles     Parlopho~  1966 Paul   McCart~
##  9 Homeward Bound            Simon and Garf~ Columbia   1966 Paul   Simon  
## 10 Scarborough Fair          Simon and Garf~ Columbia   1966 unkno~ unknown
## # ... with 11 more rows
matches
## # A tibble: 3 x 3
##   band                 year first
##   <chr>               <int> <chr>
## 1 The Beatles          1964 Paul 
## 2 The Beatles          1965 John 
## 3 Simon and Garfunkel  1966 Paul
# With semi-join
tracks %>% 
  semi_join(matches,by = c("band", "year", "first")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
track band label year first last
Can’t Buy Me Love The Beatles Parlophone 1964 Paul McCartney
Help! The Beatles Parlophone 1965 John Lennon
Ticket to Ride The Beatles Parlophone 1965 John Lennon
Homeward Bound Simon and Garfunkel Columbia 1966 Paul Simon
# With dply filter statement
tracks %>% 
  filter(
    (band == "The Beatles" & year == 1964 & first == "Paul") |
    (band == "The Beatles" & year == 1965 & first == "John") |
    (band == "Simon and Garfunkel" & year == 1966 & first == "Paul")
  ) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
track band label year first last
Can’t Buy Me Love The Beatles Parlophone 1964 Paul McCartney
Help! The Beatles Parlophone 1965 John Lennon
Ticket to Ride The Beatles Parlophone 1965 John Lennon
Homeward Bound Simon and Garfunkel Columbia 1966 Paul Simon

The filter does return the same observations as the semi-join, but it takes a bit of work to set things up properly, as you can see.

Anti-joins

  • Full list of joins available; anti-join can be used to see which rows will not be matched to a second dataset by a join.

Apply an anti-join

Anti-joins provide a useful way to reason about how a mutating join will work before you apply the join.

# Return rows of artists that don't have bands info
artists %>% 
  anti_join(bands, by = c("first", "last")) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument
Tom Jones Vocals
Davy Jones Vocals
Joe Perry Guitar
Elvis Presley Vocals
Paul Simon Guitar
Joe Walsh Guitar
Brian Wilson Vocals
Nancy Wilson Vocals

Have you noticed that correct keys are as important to filtering joins as they were to mutating joins?

Apply another anti-join

Anti-joins with anti_join() also provide a great way to diagnose joins that go wrong.

For example, they can help you zero-in on rows that have capitalization or spelling errors in the keys. These things will make your primary and secondary keys appear different to R, even though you know they refer to the same thing.

# Check whether album names in labels are mis-entered
labels %>% 
  anti_join(albums, by = c("album"))
## # A tibble: 1 x 2
##   album             label     
##   <chr>             <chr>     
## 1 A Hard Days Night Parlophone

You can think of anti-join as a debugging tactic for joins. The next exercise will help you decide when to use anti_join() and when to use semi_join().

Which filtering join?

Think you have filtering joins down? Let’s check.

Which filtering join would you use to determine how many rows in songs match a label in labels?

# Determine which key joins labels and songs
labels %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
album label
Abbey Road Apple
A Hard Days Night Parlophone
Magical Mystery Tour Parlophone
Led Zeppelin IV Atlantic
The Dark Side of the Moon Harvest
Hotel California Asylum
Rumours Warner Brothers
Aerosmith Columbia
Beggar’s Banquet Decca
songs %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song album first last
Come Together Abbey Road John Lennon
Dream On Aerosmith Steven Tyler
Hello, Goodbye Magical Mystery Tour Paul McCartney
It’s Not Unusual Along Came Jones Tom Jones
# Check your understanding
songs %>% 
  # Find the rows of songs that match a row in labels
  semi_join(labels, by = c("album")) %>% 
  # Number of matches between labels and songs
  nrow() %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
x
3

Set operations

  • union will return every row that appears in one or more of the datasets
    • If a row appears multiple times union will only return it once
  • interesect will return only the rows that appear in both datasets
  • setdiff will return the rows that appear in the first dataset but not the second

How many songs are there?

We have loaded two datasets in your workspace, aerosmith and greatest_hits, each of which represents an album from the band Aerosmith. Each row in either of the datasets is a song on that album.

How many unique songs do these two albums contain in total?

aerosmith %>% 
  # Create the new dataset using a set operation
  union(greatest_hits) %>% 
  # Count the total number of songs
  nrow() %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
x
24

The union() function provides an easy way to combine two datasets without duplicating any values.

Greatest hits

Which songs from Aerosmith made it onto Greatest Hits?

# Create the new dataset using a set operation
aerosmith %>% 
  intersect(greatest_hits) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song length
Dream On 04:28:00

You can think of intersect() as the set operator equivalent of a semi-join. It is what you would use if your datasets contain the exact same variables.

Live! Bootleg songs

live contains the songs from a third Aerosmith album, Live! Bootleg.

Which songs are on Live! Bootleg but not on Greatest Hits? Notice that the length of songs may be different when they are performed live.

# Select the song names from live
live_songs <- live %>% select(song)

# Select the song names from greatest_hits
greatest_songs <- greatest_hits %>% select(song)

# Create the new dataset using a set operation
live_songs %>% 
  setdiff(greatest_songs)  %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song
Lord of the Thighs
Toys in the Attic
Sick as a Dog
Sight for Sore Eyes
S.O.S. (Too Bad)
I Ain’t Got You
Mother Popcorn/Draw the Line
Train Kept A-Rollin’/Strangers in the Night

Multiple operations

There is no set operation to find rows that appear in one data frame or another, but not both. However, you can accomplish this by combining set operators.

Can you think of a combination that would answer the question, “Which songs appear on one of Live! Bootleg or Greatest Hits, but not both?”

  • You’ll need setdiff(), intersect(), and union(), but not necessarily in that order.
# Select songs from live and greatest_hits
live_songs <- live %>% select(song)
greatest_songs <- greatest_hits %>% select(song)

# Find songs in at least one of live_songs and greatest_songs
all_songs <- union(live_songs, greatest_songs)

# Find songs in both 
common_songs <- intersect(live_songs, greatest_songs)

# Find songs that only exist in one dataset
setdiff(all_songs, common_songs) %>% 
  # Create pretty table
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song
Lord of the Thighs
Toys in the Attic
Sick as a Dog
Sight for Sore Eyes
S.O.S. (Too Bad)
I Ain’t Got You
Mother Popcorn/Draw the Line
Train Kept A-Rollin’/Strangers in the Night
Same Old Song and Dance
Seasons of Winter
Big Ten Inch Record
Draw the Line
Kings and Queens
Remember (Walking in the Sand)
Lightning Strikes
Sweet Emotion (remix)
One Way Street (live)

Uncommonly good uncovering of uncommon songs! Working with set operators is fairly simple; it doesn’t get much trickier than this. However, set operators do have one type of odd behavior, which you will learn about in the next exercise.

Unique values

eagles1974 contains five rows, one for each member of the Eagles (a band) in 1974. Likewise, eagles1971 contains four rows, one for each member of the band in 1971.

Examine the contents of each dataset, then determine how many rows will be in the union of eagles1974 with eagles1971. Does something surprising happen here?

eagles1974

eagles1971

union(eagles1974, eagles1971)

How can a union return fewer values than are in the original dataset? The union removes the duplicate row for Don. The rows represent two different Dons, Don Felder and Don Henley, but there’s nothing in eagles1974 to indicate this.

Comparing datasets

  • Its common to want to know if one data set is the same as another dataset
  • dplyr’s setequal will do this easily
  • base R’s identical is will only return true if the datasets have the exact same rows in the exact same order

Recap:

Apply setequal

definitive and complete contain the songs that appear in competing Led Zeppelin anthologies: The Definitive Collection and The Complete Studio Recordings, respectively.

Both anthologies claim to contain the complete studio recordings of Led Zeppelin, but do the anthologies contain the same exact songs?

# Check if same order: definitive and complete
identical(definitive, complete)
## [1] FALSE
# Check if any order: definitive and complete
setequal(definitive, complete)
## [1] FALSE
# Songs in definitive but not complete
setdiff(definitive, complete)
## # A tibble: 15 x 2
##    song                        album                    
##    <chr>                       <chr>                    
##  1 Rock and Roll               The Song Remains the Same
##  2 Celebration Day             The Song Remains the Same
##  3 Black Dog                   The Song Remains the Same
##  4 Over the Hills and Far Away The Song Remains the Same
##  5 Misty Mountain Hop          The Song Remains the Same
##  6 Since I've Been Loving You  The Song Remains the Same
##  7 No Quarter                  The Song Remains the Same
##  8 The Song Remains the Same   The Song Remains the Same
##  9 The Rain Song               The Song Remains the Same
## 10 The Ocean                   The Song Remains the Same
## 11 Dazed and Confused          The Song Remains the Same
## 12 Stairway to Heaven          The Song Remains the Same
## 13 Moby Dick                   The Song Remains the Same
## 14 Heartbreaker                The Song Remains the Same
## 15 Whole Lotta Love            The Song Remains the Same
# Songs in complete but not definitive
setdiff(complete, definitive)
## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

The Definitive Collection contains the soundtrack for The Song Remains the Same, a documentary not featured in The Complete Studio Recordings.

Apply setequal again

A few exercises ago, you saw that an intersect() is analagous to a semi_join() when two datasets contain the same variables and each variable is used in the key.

Under these conditions, setdiff() is also analagous to one of the filtering joins.

# Return songs in definitive that are not in complete
definitive %>% 
  anti_join(complete, by = c("song", "album")) %>% 
  # Create pretty table
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song album
Rock and Roll The Song Remains the Same
Celebration Day The Song Remains the Same
Black Dog The Song Remains the Same
Over the Hills and Far Away The Song Remains the Same
Misty Mountain Hop The Song Remains the Same
Since I’ve Been Loving You The Song Remains the Same
No Quarter The Song Remains the Same
The Song Remains the Same The Song Remains the Same
The Rain Song The Song Remains the Same
The Ocean The Song Remains the Same
# Return songs in complete that are not in definitive
complete %>% 
  anti_join(definitive, by = c("song", "album")) %>% 
  # Create pretty table
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song album

When your datasets contain the same variables, a setdiff() does the same thing as an anti_join() that uses every column as a key.

Comparing albums

It appears that The Definitive Collection contains songs from the soundtrack of The Song Remains the Same, a movie filmed during a live Led Zeppelin concert. Is this the only difference between The Definitive Collection and The Complete Studio Recordings?

nrow(complete)
## [1] 85
nrow(definitive)
## [1] 100
# Get union of complete and soundtrack
complete_and_soundtrack <- union(complete, soundtrack)

# Check if same, including order: definitive and union of complete and soundtrack
complete_and_soundtrack %>%
  identical(definitive)
## [1] FALSE
# Check if same, rows in any order: definitive and union of complete and soundtrack
complete_and_soundtrack %>%
  setequal(definitive)
## [1] TRUE

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0  kableExtra_1.0.1
## [5] knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     tibble_2.0.1      viridisLite_0.3.0
## [13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1        
## [16] rstudioapi_0.9.0  yaml_2.2.0        xfun_0.4         
## [19] withr_2.1.2       httr_1.4.0        stringr_1.4.0    
## [22] xml2_1.2.0        hms_0.4.2         webshot_0.5.1    
## [25] grid_3.5.2        tidyselect_0.2.5  glue_1.3.0       
## [28] R6_2.4.0          fansi_0.4.0       rmarkdown_1.11   
## [31] readr_1.3.1       purrr_0.3.0       magrittr_1.5     
## [34] codetools_0.2-15  scales_1.0.0      htmltools_0.3.6  
## [37] assertthat_0.2.0  rvest_0.3.2       colorspace_1.4-0 
## [40] utf8_1.1.4        stringi_1.3.1     lazyeval_0.2.1   
## [43] munsell_0.5.0     crayon_1.3.4